optimize allreduce write mode by broadcast output ptr #1910

yanboshao · 2026-01-26T15:12:01Z

Motivation

If the output pointer of other ranks can be obtained within the kernel, data can be directly written to the remote rank's output during stage 2, reducing one access to local HBM.

Technical Details

In graph mode, broadcast the output pointer on the host side.

Test Plan

Test Result

Dtype: bf16
Device: Mi308 * 8
CudaGraph: on

Shape	Old (μs)	New (μs)	Ratio (%)
440x5120	44.85	36.92	17.68%
512x5120	46.00	43.54	5.35%
512x7168	62.32	54.62	12.36%
512x8192	66.25	61.50	7.16%
632x5120	53.12	47.91	9.81%
680x5120	58.03	54.04	6.87%

Dtype: bf16
Device: Mi325*8
CudaGraph: on

Shape	Old (μs)	New (μs)	Ratio (%)
440x5120	33.40	32.00	4.18%
512x5120	38.19	37.77	1.11%
512x7168	53.33	51.41	3.61%
512x8192	57.32	55.91	2.47%
632x5120	45.45	43.91	3.38%
680x5120	48.68	46.94	3.57%

Dtype: bf16
Device: Mi355*8
CudaGraph: on

Shape	Old (μs)	New (μs)	Ratio (%)
440x5120	24.91	24.98	-0.25%
512x5120	28.47	28.55	-0.27%
512x7168	38.49	38.71	-0.57%
512x8192	42.71	42.64	0.17%
632x5120	34.12	33.95	0.50%
680x5120	36.33	36.33	0.00%

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

valarLip

LGTM

optimize allreduce write mode by broadcast output addr

1d4bf10

yanboshao requested a review from a team January 26, 2026 15:12

yanboshao changed the title ~~optimize allreduce write mode by broadcast output addr~~ optimize allreduce write mode by broadcast output ptr Jan 26, 2026

yanboshao added 2 commits January 26, 2026 23:13

Merge branch 'main' into yanbo/allreduce_wm

1a5fad0

Merge branch 'main' into yanbo/allreduce_wm

b1fe5c5

valarLip approved these changes Jan 30, 2026

View reviewed changes

Merge branch 'main' into yanbo/allreduce_wm

20bcf19

valarLip self-assigned this Jan 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize allreduce write mode by broadcast output ptr #1910

optimize allreduce write mode by broadcast output ptr #1910

Uh oh!

yanboshao commented Jan 26, 2026 •

edited

Loading

Uh oh!

valarLip left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

optimize allreduce write mode by broadcast output ptr #1910

Are you sure you want to change the base?

optimize allreduce write mode by broadcast output ptr #1910

Uh oh!

Conversation

yanboshao commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

valarLip left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yanboshao commented Jan 26, 2026 •

edited

Loading